Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336
Conversation
Greptile SummaryThis PR introduces Key changes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant UL as Upper Layer<br/>(ServerEngine/ClientExecutor)
participant EH as Event System
participant KL as K8sJobLauncher
participant KH as K8sJobHandle
participant K8s as Kubernetes API
UL->>EH: fire BEFORE_JOB_LAUNCH event
EH->>KL: handle_event(BEFORE_JOB_LAUNCH, fl_ctx)
KL->>KL: extract_job_image(job_meta, site_name)
alt job_image present
KL->>EH: add_launcher(self, fl_ctx)
end
UL->>KL: launch_job(job_meta, fl_ctx)
KL->>KL: uuid4_to_rfc1123(raw_job_id)
KL->>KL: get_module_args(job_id, fl_ctx)
KL->>KH: K8sJobHandle(job_id, core_v1, job_config)
KH->>KH: _make_manifest(job_config)
KH-->>KL: job_handle
KL->>K8s: create_namespaced_pod(pod_manifest)
alt pod creation fails
K8s-->>KL: Exception
KL->>KH: terminal_state = TERMINATED
KL-->>UL: job_handle (terminated)
else pod created
K8s-->>KL: ok
KL->>KH: enter_states([RUNNING])
loop poll until RUNNING or timeout
KH->>K8s: read_namespaced_pod(job_id)
K8s-->>KH: pod phase
alt stuck in PENDING
KH->>K8s: delete_namespaced_pod(job_id)
KH->>KH: terminal_state = TERMINATED
end
end
KH-->>KL: True/False
KL-->>UL: job_handle
end
UL->>KH: wait()
loop until terminal
KH->>K8s: read_namespaced_pod(job_id)
K8s-->>KH: phase (SUCCEEDED/FAILED/RUNNING)
alt SUCCEEDED or FAILED
KH->>KH: terminal_state = job_state
end
end
UL->>KH: poll()
KH-->>UL: JobReturnCode (SUCCESS/ABORTED/UNKNOWN)
Last reviewed commit: "Fix reference error" |
There was a problem hiding this comment.
Pull request overview
Adds documentation, Kubernetes-oriented job launching support, and unit tests around the new JobLauncher/JobHandle abstractions, plus “best effort” resource management utilities intended for orchestration backends.
Changes:
- Add detailed design documentation for JobLauncherSpec/JobHandleSpec and the Process/Docker/K8s implementations.
- Extend the Kubernetes launcher/handle to build PVC-backed pod manifests, support GPU limits, and add timeout/stuck-handling logic.
- Introduce BE (best-effort) resource manager/consumer utilities and expand GPUResourceManager with an
ignore_hostoption. - Add unit tests for K8s and Docker job launchers/handles.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
docs/design/JobLauncher_and_JobHandle.md |
New design doc describing the launcher/handle architecture and backends. |
nvflare/apis/job_launcher_spec.py |
Docstring updates to clarify expected return values. |
nvflare/app_opt/job_launcher/k8s_launcher.py |
Major updates to K8s pod manifest creation, lifecycle handling, and resource/PVC wiring. |
nvflare/private/fed/client/communicator.py |
Adds configurable timeout for waiting on cell creation during client registration. |
nvflare/app_common/resource_managers/gpu_resource_manager.py |
Adds ignore_host option to skip host GPU validation. |
nvflare/app_common/resource_managers/BE_resource_manager.py |
New “best effort” resource manager implementation. |
nvflare/app_common/resource_consumers/BE_resource_consumer.py |
New no-op resource consumer. |
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py |
New unit tests for K8s job handle/launcher behavior. |
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py |
New unit tests for Docker job handle/launcher behavior. |
tests/unit_test/app_opt/job_launcher/__init__.py |
Test package init. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
8607f8a to
5ac3bf1
Compare
5ac3bf1 to
7eb5b0e
Compare
7eb5b0e to
1048b0b
Compare
de9d011 to
9708848
Compare
9708848 to
4138a23
Compare
Job Launcher for K8s environement GPU, image, pvc updated and working Add codes Add unit tests
4138a23 to
fb21e24
Compare
102c57e to
a1f1228
Compare
34fee96 to
59124fb
Compare
|
/build |
a21c83d to
27a8b46
Compare
Address comments
27a8b46 to
2d9eb0f
Compare
|
/build |
Description
Job Launcher and Job Handle design docs
Implementation of Job Launcher and Job Handle for kubernetes (docker is WIP)
Unit tests for implementation
Types of changes
./runtest.sh.